# Looking for Lateral Movement

----

*Lateral movement* is a cyberattack pattern that describes how an adversary leverages a single foothold to compromise other systems within a network.
Identifying and stopping lateral movement is an important step in controlling the damage from a breach, and also plays a role in forensic analysis of a cyberattack, helping to identify its source and reconstruct what happened.
In this notebook, we show how xGT can be used to find evidence of these types of patterns hiding in large data.

This notebook is an example of using 
the vast collection of malicious cyber attack patterns described in the [MITRE ATT&CK Catalog](https://attack.mitre.org/) as a guide to search for evidence of lateral movemement within an enterprise network.

For data, we'll be using the [LANL Unified Host and Network Dataset](http://datasets.trovares.com/cyber/LANL/index.html), a set of netflow and host event data collected on an internal Los Alamos National Lab network.

----
## RDP Hijacking

There are 17 *lateral movement* techniques presented in the MITRE ATT&CK Catalog.
We will consider the *RDP Hijacking* technique presented as [tactic 1076](https://attack.mitre.org/techniques/T1076/).

RDP hijacking is actually a family of attacks, each with different characteristics on how to attain the
privileges required to perform the RDP Hijacking.
The attack broadly looks like this:

1. Lateral movement starts from a foothold where an adversary already has gained access. We'll call this host `A`.

1. The attacker uses some *privilege escalation* technique to attain SYSTEM privilege.

1. The attacker then leverages their SYSTEM privilege to *hijack* as RDP session to
[move through a network](https://doublepulsar.com/rdp-hijacking-how-to-hijack-rds-and-remoteapp-sessions-transparently-to-move-through-an-da2a1e73a5f6).
The result is to become logged in to another system where the RDP session had been.  We'll call this host `B`.

This hijacking action can be repeated to form longer chains of lateral movememt; and these chains
can be represented as graph patterns:

![rdp_hijack](images/lateral-movement.png)


----
## Privilege Escalation

The MITRE ATT&CK Catalog contains 28 different techniques for performing privilege escalation.
For our example, we will look for evidence of RDP Hijacking where privilege escalation was carried out using 
a technique called *Accessibility Features* described as [T1015](https://attack.mitre.org/techniques/T1015/).

The astute reader will note that we are looking for only one of 476 (or more) techniques for lateral movement.
Each of the others might result in different graph patterns and different queries, but can all be addressed
using the same approach described here.

----
## Mapping to a cyber dataset

In order to formulate a query, we need to understand the content and structure of our
graph.
We will work under the assumption that we have both *netflow* and *windows server log* event information.

Mapping each of the adversary steps (the yellow-colored circles in the diagram) to our dataset:

1. "Accessibility Features (*privilege escalation*)": An adversary modifies the way programs are launched 
to get a back door into a system.  The following programs can be used for this purpose:
    1. `sethc.exe`
    1. `utilman.exe`

1. "RDP Session Hijack":  Once an adversary finds a session to hijack they can do this command:  `c:\windows\system32\tscon.exe [session number to be stolen]`.  We look in our graph for windows log events showing the running of the `tscon.exe` program.

1. "RDP/RDS Netflow": Logging in to system `B` will leave one or more netflow packets from system `A` to `B` that use the RDP port.


## Mapping to the LANL dataset

Once we understand the pattern we want to find, we need to determine what specifically to look for in the dataset.

We first need to understand that the LANL dataset has been modified from its raw form.
For example, the anonymization process replaced many of the program names with arbitrary strings such as `Prog123456.exe`.  Also, the program arguments (such as a `/network` option) are not recorded.

Given this lack of information, we will emulate a search for the RDP Hijacking lateral movement behavior by picking some actual values present in the LANL data as a proxy to desired programs such as `sethc.exe`.  Here are the mappings:

 - In steps 1 and 4, we will use the string `Proc336322.exe` as a proxy for the `sethc.exe` program and the string `Proc695356.exe` as a proxy for the `utilman.exe` program.
 - In steps 2 and 5, we will use the string `Proc249569.exe` as a proxy for the `tscon.exe` program.


----
## Using xGT to perform this search

The rest of this notebook demonstrates how to take this LANL data and the search pattern description to do these steps:
  1. Ingest the cyber data into xGT
  2. Search for all occurrences of this pattern.

In [1]:
import xgt
conn = xgt.Connection()
conn

<xgt.connection.Connection at 0x1093c6e10>

## Establish Graph Component Schemas

We first try to retrieve the graph component schemas from xGT server.
If that should fail, we create an empty component (vertex or edge frame) for the missing component.

In [2]:
try:
  devices = conn.get_vertex_frame('Devices')
except xgt.XgtNameError:
  devices = conn.create_vertex_frame(
              name='Devices',
              schema=[['device', xgt.TEXT]],
              key='device')
devices

<xgt.graph.VertexFrame at 0x109fcc690>

In [3]:
try:
  netflow = conn.get_edge_frame('Netflow')
except xgt.XgtNameError:
  netflow = conn.create_edge_frame(
            name='Netflow',
            schema=[['epochtime', xgt.INT],
                    ['duration', xgt.INT],
                    ['srcDevice', xgt.TEXT],
                    ['dstDevice', xgt.TEXT],
                    ['protocol', xgt.INT],
                    ['srcPort', xgt.INT],
                    ['dstPort', xgt.INT],
                    ['srcPackets', xgt.INT],
                    ['dstPackets', xgt.INT],
                    ['srcBytes', xgt.INT],
                    ['dstBytes', xgt.INT]],
            source=devices,
            target=devices,
            source_key='srcDevice',
            target_key='dstDevice')
netflow

<xgt.graph.EdgeFrame at 0x109fcc750>

**Edges:** The LANL dataset contains two types of data: netflow and host events. Of the host events recorded, some describe events within a device (e.g., reboots), and some describe events between devices (e.g., login attempts). We'll only be loading the netflow data and in-device events. We call these events "one-sided", since we describe them as graph edges from one vertex to itself.

In [4]:
try:
  events1v = conn.get_edge_frame('Events1v')
except xgt.XgtNameError:
  events1v = conn.create_edge_frame(
           name='Events1v',
           schema=[['epochtime', xgt.INT],
                   ['eventID', xgt.INT],
                   ['logHost', xgt.TEXT],
                   ['userName', xgt.TEXT],
                   ['domainName', xgt.TEXT],
                   ['logonID', xgt.INT],
                   ['processName', xgt.TEXT],
                   ['processID', xgt.INT],
                   ['parentProcessName', xgt.TEXT],
                   ['parentProcessID', xgt.INT]],
           source=devices,
           target=devices,
           source_key='logHost',
           target_key='logHost')
events1v

<xgt.graph.EdgeFrame at 0x109fcc710>

In [5]:
try:
  events2v = conn.get_edge_frame('Events2v')
except xgt.XgtNameError:
  events2v = conn.create_edge_frame(
           name='Events2v',
           schema = [['epochtime',xgt.INT],
                     ['eventID',xgt.INT],
                     ['logHost',xgt.TEXT],
                     ['logonType',xgt.INT],
                     ['logonTypeDescription',xgt.TEXT],
                     ['userName',xgt.TEXT],
                     ['domainName',xgt.TEXT],
                     ['logonID',xgt.INT],
                     ['subjectUserName',xgt.TEXT],
                     ['subjectDomainName',xgt.TEXT],
                     ['subjectLogonID',xgt.TEXT],
                     ['status',xgt.TEXT],
                     ['src',xgt.TEXT],
                     ['serviceName',xgt.TEXT],
                     ['destination',xgt.TEXT],
                     ['authenticationPackage',xgt.TEXT],
                     ['failureReason',xgt.TEXT],
                     ['processName',xgt.TEXT],
                     ['processID',xgt.INT],
                     ['parentProcessName',xgt.TEXT],
                     ['parentProcessID',xgt.INT]],
            source = 'Devices',
            target = 'Devices',
            source_key = 'src',
            target_key = 'logHost')
events2v

<xgt.graph.EdgeFrame at 0x109fd9fd0>

In [6]:
# Utility to print the sizes of data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host event 1-vertex (edges): {:,}'.format(events1v.num_edges))
  print('Host event 2-vertex (edges): {:,}'.format(events2v.num_edges))
  print('Total (edges): {:,}'.format(
      netflow.num_edges + events1v.num_edges + events2v.num_edges))
    
print_data_summary()

Devices (vertices): 0
Netflow (edges): 0
Host event 1-vertex (edges): 0
Host event 2-vertex (edges): 0
Total (edges): 0


## Load the data

This section may be skipped if you are connecting to an xGT server that already has the LANL data loaded.

**Load the 1-sided host event data:**

In [7]:
%%time
if events1v.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_1v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_1v.csv".format(_) for _ in range(2,91)]
    events1v.load(urls)
    print_data_summary()

Devices (vertices): 10,324
Netflow (edges): 0
Host event 1-vertex (edges): 18,637,483
Host event 2-vertex (edges): 0
Total (edges): 18,637,483
CPU times: user 9.28 ms, sys: 7.46 ms, total: 16.7 ms
Wall time: 38.4 s


**Load the 2-sided host event data:**

In [8]:
%%time
if events2v.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_2v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_2v.csv".format(_) for _ in range(2,91)]
    events2v.load(urls)
    print_data_summary()

Devices (vertices): 12,140
Netflow (edges): 0
Host event 1-vertex (edges): 18,637,483
Host event 2-vertex (edges): 47,790,045
Total (edges): 66,427,528
CPU times: user 21.2 ms, sys: 22.6 ms, total: 43.8 ms
Wall time: 1min 55s


**Load the netflow data:**

In [9]:
%%time
if netflow.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-85.csv"]
    #urls = ["xgtd://nf_day-{:02d}.csv".format(_) for _ in range(2,91)]
    netflow.load(urls)
    print_data_summary()

Devices (vertices): 137,705
Netflow (edges): 235,661,328
Host event 1-vertex (edges): 18,637,483
Host event 2-vertex (edges): 47,790,045
Total (edges): 302,088,856
CPU times: user 55.7 ms, sys: 69.7 ms, total: 125 ms
Wall time: 5min 48s


## End of "load the data"

----

Now define some useful functions and get on with the querying ...

In [10]:
# Utility function to launch queries and show job number:
#   The job number may be useful if a long-running job needs
#   to be canceled.

def run_query(query, table_name = "answers", drop_answer_table=True, show_query=False):
    if drop_answer_table:
        conn.drop_frame(table_name)
    if query[-1] != '\n':
        query += '\n'
    query += 'INTO {}'.format(table_name)
    if show_query:
        print("Query:\n" + query)
    job = conn.schedule_job(query)
    print("Launched job {}".format(job.id))
    conn.wait_for_job(job)
    table = conn.get_table_frame(table_name)
    return table

## Pulling out only RDP netflow edges

Because of the way the netflow is represented, there may be some netflow edges in the *forward* direction where the `dstPort` field indicates RDP (`dstPort = 3389`), and other edges in the *reverse* direction where the `srcPort` field contains 3389.

The following section of code pulls out all forward RDP edges and drops them into a new edge frame.
It then pulls out all reverse RDP edges, reverses the appropriate fields (i.e., swapping `dst` and `src` versions of the attribute values), and adds these reversed RDP edges to the new edge frame.

Note that the edges in this new edge frame connect up with the same set of vertices as the netflow edges.

We first generate a new edge frame we call `RDPflow` that has the exact same schema as the netflow edge frame.

In [11]:
# Generate a new edge frame for holding only the RDP edges
conn.drop_frame('RDPflow')
rdpflow = conn.create_edge_frame(
            name='RDPflow',
            schema=netflow.schema,
            source=devices,
            target=devices,
            source_key='srcDevice',
            target_key='dstDevice')
rdpflow

<xgt.graph.EdgeFrame at 0x109fccc10>

### Extract forward RDP edges

A "forward" edge is one where the `dstPort = 3389`.
This edge is copied verbatim to the `RDPflow` edge frame.

In [12]:
%%time
q = """
MATCH ()-[edge:Netflow]->()
WHERE edge.dstPort=3389
MERGE (v0: Devices { device : edge.srcDevice })
MERGE (v1: Devices { device : edge.dstDevice })
CREATE (v0)-[e:RDPflow {epochtime : edge.epochtime,
  duration : edge.duration, protocol : edge.protocol,
  srcPort : edge.srcPort, dstPort : edge.dstPort,
  srcPackets : edge.srcPackets, dstPackets : edge.dstPackets,
  srcBytes : edge.srcBytes, dstBytes : edge.dstBytes}]->(v1)
RETURN count(*)
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.get_data()[0][0]))

Launched job 8
Number of answers: 31
CPU times: user 9.28 ms, sys: 6.63 ms, total: 15.9 ms
Wall time: 5.03 s


### Extract reverse RDP edges

A "reverse" edge is one where the `srcPort = 3389`.
These edges are copied to the `RDPflow` edge frame but **reversed** in transit.
The reversal process involves swapping the: `srcDevice` and `dstDevice`;
`srcPort` and `dstPort`; `srcPackets` and `dstPackets`; and `srcBytes` and `dstBytes`.

In [13]:
%%time
q = """
MATCH ()-[edge:Netflow]->()
WHERE edge.srcPort=3389
MERGE (v0: Devices { device : edge.srcDevice })
MERGE (v1: Devices { device : edge.dstDevice })
CREATE (v1)-[e:RDPflow {epochtime : edge.epochtime,
  duration : edge.duration, protocol : edge.protocol,
  srcPort : edge.dstPort, dstPort : edge.srcPort,
  srcPackets : edge.dstPackets, dstPackets : edge.srcPackets,
  srcBytes : edge.dstBytes, dstBytes : edge.srcBytes}]->(v0)
RETURN count(*)
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.get_data()[0][0]))

Launched job 11
Number of answers: 9,701
CPU times: user 8.79 ms, sys: 4.64 ms, total: 13.4 ms
Wall time: 4.99 s


### Resulting RDPflow

The result of combining these two "edge-create" queries is the `RDPflow` edge frame containing only "forward" RDP edges.
This alternate edge frame holding only RDP edges can be used instead of the generic
`Netflow` edge frame where an RDP edge is required in a query.

In [14]:
data=None
if rdpflow.num_edges == 0:
    print("RDPflow is empty")
elif rdpflow.num_edges <= 1000:
    data = rdpflow.get_data_pandas()
else:
    data = 'RDPflow (edges): {:,}'.format(rdpflow.num_edges)
data

'RDPflow (edges): 9,732'

In [15]:
# Utility to print the data sizes currently in xGT
def print_netflow_data_summary():
  print_data_summary()
  print('RDPflow (edges): {:,}'.format(rdpflow.num_edges))

print_netflow_data_summary()

Devices (vertices): 137,705
Netflow (edges): 235,661,328
Host event 1-vertex (edges): 18,637,483
Host event 2-vertex (edges): 47,790,045
Total (edges): 302,088,856
RDPflow (edges): 9,732


### Building a better query: adding temporal constraints 

Being more specific about what you're looking for is a good way to both improve performance and cut down on false positives in your results.
In our example, there is a causal dependence between the attacker's steps, which means that they must be temporally ordered.
So if *t<sub>1</sub>* represents the time at which event 1 takes place, we know that:

*t<sub>1</sub>* &le; *t<sub>2</sub>* &le; *t<sub>3</sub>* &le; *t<sub>4</sub>* &le; *t<sub>5</sub>* &le; *t<sub>6</sub>*

In addition, since this pattern models intentional lateral movement, we suspect that some of these events will be close together in time.
We can narrow the results by setting a maximum time thresholds between specific groups of events:

 - Between an RDP Hijack (`tscon.exe`) and a subsequent RDP netflow is called the *hijack threshold*
 - From the initial *privilege escalation* event to the RDP netflow is called the *one_step threshold*
 - The time allowed between between steps (e.g., the time between RDP1 and RDP2), is called the *between_step threshold*

Given some fixed constants for these thresholds, we can impose the following additional constraints:

 - *t<sub>3</sub>* - *t<sub>2</sub>* &le; *hijack threshold*
 - *t<sub>3</sub>* - *t<sub>1</sub>* &le; *one_step threshold*
 - *t<sub>6</sub>* - *t<sub>5</sub>* &le; *hijack threshold*
 - *t<sub>6</sub>* - *t<sub>4</sub>* &le; *one_step threshold*
 - *t<sub>3</sub>* - *t<sub>1</sub>* &le; *between_step threshold*

We will add all of these onstraints to our query to help focus on just the results we want.

### Lateral Movement query

This query leverages the new `RDPflow` edge frame (and data) to find the proper RDP edges for steps #3 and #6.

In [16]:
%%time
time_threshold_between_step = 3600   # one hour
time_threshold_hijack = 180          # three minutes
time_threshold_one_step = 480        # eight minutes
q = """
MATCH (A)-[rdp1:RDPflow]->(B)-[rdp2:RDPflow]->(C),
      (A)-[hijack1:Events1v]->(A)-[privEsc1:Events1v]->(A),
      (B)-[hijack2:Events1v]->(B)-[privEsc2:Events1v]->(B)
WHERE A <> B AND B <> C AND A <> C 
  AND privEsc1.eventID = 4688 
  AND (privEsc1.processName = "Proc336322.exe" OR privEsc1.processName = "Proc695356.exe")
  AND hijack1.eventID = 4688 AND hijack1.processName = "Proc249569.exe"
  AND privEsc2.eventID = 4688 
  AND (privEsc2.processName = "Proc336322.exe" OR privEsc2.processName = "Proc695356.exe")
  AND hijack2.eventID = 4688 AND hijack2.processName = "Proc249569.exe"
  // Check time constraints on the overall pattern
  AND rdp1.epochtime <= rdp2.epochtime
  AND rdp2.epochtime - rdp1.epochtime < {0}
  // Check time constraints on step from A to B
  AND privEsc1.epochtime <= hijack1.epochtime
  AND hijack1.epochtime <= rdp1.epochtime
  AND rdp1.epochtime - hijack1.epochtime < {1}
  AND rdp1.epochtime - privEsc1.epochtime < {2}
  // Check time constraints on step from B to C
  AND privEsc2.epochtime <= hijack2.epochtime
  AND hijack2.epochtime <= rdp2.epochtime
  AND rdp2.epochtime - hijack2.epochtime < {1}
  AND rdp2.epochtime - privEsc2.epochtime < {2}
RETURN rdp1.srcDevice, rdp1.dstDevice, rdp1.epochtime, rdp2.dstDevice, rdp2.epochtime
""".format(time_threshold_between_step, time_threshold_hijack, time_threshold_one_step)
answer_table = run_query(q)
print('Number of answers: {:,}'.format(answer_table.num_rows))

Launched job 14
Number of answers: 10,572
CPU times: user 6.62 ms, sys: 3.42 ms, total: 10 ms
Wall time: 1.88 s


In [17]:
# retrieve the answer rows to the client in a pandas frame
data = answer_table.get_data_pandas()
data

Unnamed: 0,rdp1_srcDevice,rdp1_dstDevice,rdp1_epochtime,rdp2_dstDevice,rdp2_epochtime
0,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
1,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
2,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
3,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
4,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
5,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
6,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
7,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
8,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
9,ActiveDirectory,EnterpriseAppServer,7290438,Comp073202,7290972
