# Insider Exfiltration

----

We are looking for this graph pattern in the large data graph referred to as the [LANL Unified Host and Network Dataset](https://datasets.trovares.com/cyber/LANL/index.html), a set of netflow and host event data collected on an internal Los Alamos National Lab network.

The LANL dataset consists of:

- Netflow data (aggregated and sessionized)
- Windows Logging Events - 1: events that involve exactly one device such as *reboot*
- Windows Logging Events - 2: events that involve exactly two devices such as *failed authentication attempt from device A to device B*
 

## Motivation for graph pattern


This notebook shows one kind of *graph pattern search* following the recommendations of the [Common Sense Guide to Mitigating Insider Threats, Fifth Edition](https://resources.sei.cmu.edu/asset_files/TechnicalReport/2016_005_001_484758.pdf).

This graph pattern is motived by the following scenario:

- An employee is working with a competitor to exfiltrate sensitive data
- They do this by logging in to multiple systems within the enterprise that hold the sensitive data
- From each sensitive data store, they launch a program that sends data out to a common exfiltration target

This pattern is shown here:

<img src="images/insider-xfil.png" alt="Insider Exfiltration" />

where:

- the red edges from A to B, C, and D, are all *successful authentication* (login) events
- the purple edges that are self-loops are all *program start* events
- the black edges from B, C, and D to E are all netflow records with high byte counts going to the same destination port at device E.



----
## Using xGT to perform this search

The rest of this notebook demonstrates how to take this LANL data and the search pattern description to do these steps:
  1. Ingest the cyber data into xGT
  2. Search for all occurrences of this pattern.

In [1]:
import xgt
conn = xgt.Connection()
conn

<xgt.connection.Connection at 0x102f04ed0>

## Establish Graph Component Schemas

We first try to retrieve the graph component schemas from xGT server.
If that should fail, we create an empty component (vertex or edge frame) for the missing component.

In [2]:
try:
  devices = conn.get_vertex_frame('Devices')
except xgt.XgtNameError:
  devices = conn.create_vertex_frame(
              name='Devices',
              schema=[['device', xgt.TEXT]],
              key='device')
devices

<xgt.graph.VertexFrame at 0x103b11f50>

In [3]:
try:
  netflow = conn.get_edge_frame('Netflow')
except xgt.XgtNameError:
  netflow = conn.create_edge_frame(
            name='Netflow',
            schema=[['epochtime', xgt.INT],
                    ['duration', xgt.INT],
                    ['srcDevice', xgt.TEXT],
                    ['dstDevice', xgt.TEXT],
                    ['protocol', xgt.INT],
                    ['srcPort', xgt.INT],
                    ['dstPort', xgt.INT],
                    ['srcPackets', xgt.INT],
                    ['dstPackets', xgt.INT],
                    ['srcBytes', xgt.INT],
                    ['dstBytes', xgt.INT]],
            source=devices,
            target=devices,
            source_key='srcDevice',
            target_key='dstDevice')
netflow

<xgt.graph.EdgeFrame at 0x103b1cc90>

**Edges:** The LANL dataset contains two types of data: netflow and host events. Of the host events recorded, some describe events within a device (e.g., reboots), and some describe events between devices (e.g., login attempts). We'll only be loading the netflow data and in-device events. We call these events "one-sided", since we describe them as graph edges from one vertex to itself.

In [4]:
try:
  events1v = conn.get_edge_frame('Events1v')
except xgt.XgtNameError:
  events1v = conn.create_edge_frame(
           name='Events1v',
           schema=[['epochtime', xgt.INT],
                   ['eventID', xgt.INT],
                   ['logHost', xgt.TEXT],
                   ['userName', xgt.TEXT],
                   ['domainName', xgt.TEXT],
                   ['logonID', xgt.INT],
                   ['processName', xgt.TEXT],
                   ['processID', xgt.INT],
                   ['parentProcessName', xgt.TEXT],
                   ['parentProcessID', xgt.INT]],
           source=devices,
           target=devices,
           source_key='logHost',
           target_key='logHost')
events1v

<xgt.graph.EdgeFrame at 0x103b1ccd0>

In [5]:
try:
  events2v = conn.get_edge_frame('Events2v')
except xgt.XgtNameError:
  events2v = conn.create_edge_frame(
           name='Events2v',
           schema = [['epochtime',xgt.INT],
                     ['eventID',xgt.INT],
                     ['logHost',xgt.TEXT],
                     ['logonType',xgt.INT],
                     ['logonTypeDescription',xgt.TEXT],
                     ['userName',xgt.TEXT],
                     ['domainName',xgt.TEXT],
                     ['logonID',xgt.INT],
                     ['subjectUserName',xgt.TEXT],
                     ['subjectDomainName',xgt.TEXT],
                     ['subjectLogonID',xgt.TEXT],
                     ['status',xgt.TEXT],
                     ['src',xgt.TEXT],
                     ['serviceName',xgt.TEXT],
                     ['destination',xgt.TEXT],
                     ['authenticationPackage',xgt.TEXT],
                     ['failureReason',xgt.TEXT],
                     ['processName',xgt.TEXT],
                     ['processID',xgt.INT],
                     ['parentProcessName',xgt.TEXT],
                     ['parentProcessID',xgt.INT]],
            source = 'Devices',
            target = 'Devices',
            source_key = 'src',
            target_key = 'logHost')
events2v

<xgt.graph.EdgeFrame at 0x103b11d10>

In [6]:
# Utility to print the sizes of data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host event 1-vertex (edges): {:,}'.format(events1v.num_edges))
  print('Host event 2-vertex (edges): {:,}'.format(events2v.num_edges))
  print('Total (edges): {:,}'.format(
      netflow.num_edges + events1v.num_edges + events2v.num_edges))
    
print_data_summary()

Devices (vertices): 159,245
Netflow (edges): 317,164,045
Host event 1-vertex (edges): 33,480,483
Host event 2-vertex (edges): 97,716,529
Total (edges): 448,361,057


## Load the data

If you are already connected to an xGT server with data loaded, this section may be skipped.
You may skip ahead to the "**Utility python functions for interacting with xGT**" section.

**Load the 1-sided host event data:**

In [None]:
%%time
if events1v.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_1v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_1v.csv".format(_) for _ in range(2,91)]
    events1v.load(urls)
    print_data_summary()

**Load the 2-sided host event data:**

In [None]:
%%time
if events2v.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-85_2v.csv"]
    # urls = ["xgtd://wls_day-{:02d}_2v.csv".format(_) for _ in range(2,91)]
    events2v.load(urls)
    print_data_summary()

**Load the netflow data:**

In [None]:
%%time
if netflow.num_edges == 0:
    urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-85.csv"]
    #urls = ["xgtd://nf_day-{:02d}.csv".format(_) for _ in range(2,91)]
    netflow.load(urls)
    print_data_summary()

## Utility python functions for interacting with xGT

----

Now define some useful functions and get on with the querying ...

In [7]:
# Utility function to launch queries and show job number:
#   The job number may be useful if a long-running job needs
#   to be canceled.

import time
def run_query(query, table_name = "answers", drop_answer_table=True, show_query=False):
    if drop_answer_table:
        conn.drop_frame(table_name)
    if query[-1] != '\n':
        query += '\n'
    query += 'INTO {}'.format(table_name)
    if show_query:
        print("Query:\n" + query)
    job = conn.schedule_job(query)
    print("Launched job {} at time: ".format(job.id, time.asctime()))
    conn.wait_for_job(job)
    table = conn.get_table_frame(table_name)
    return table

## Looking for one path

This query looks for only one path from A to E (through B)


In [8]:
%%time
q = """
MATCH
  (E)<-[nf1:Netflow]-(B)<-[login1:Events2v]-(A), (B)<-[prog1:Events1v]-(B)
WHERE A <> B AND B <> E AND A <> E
  AND login1.eventID = 4624
  AND prog1.eventID = 4688
  AND nf1.dstBytes > 100000000
  // time constraints within each path
  AND login1.epochtime < prog1.epochtime
  AND prog1.epochtime < nf1.epochtime
  AND nf1.epochtime - login1.epochtime <= 30
RETURN count(*)
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.get_data()[0][0]))

Launched job 7
Number of answers: 2,992,551
CPU times: user 25.9 ms, sys: 11.7 ms, total: 37.6 ms
Wall time: 21.8 s


## Looking for three paths

This query looks for at least three paths from A to E (through B, C, and D)

<img src="images/insider-xfil.png" alt="Insider Exfiltration" />

In [9]:
%%time
q = """
MATCH
  (E)<-[nf1:Netflow]-(B)<-[login1:Events2v]-(A), (B)<-[prog1:Events1v]-(B),
  (E)<-[nf2:Netflow]-(C)<-[login2:Events2v]-(A), (C)<-[prog2:Events1v]-(C),
  (E)<-[nf3:Netflow]-(D)<-[login3:Events2v]-(A), (D)<-[prog3:Events1v]-(D)
WHERE A <> B AND A <> C AND A <> D AND A <> E AND B <> C AND B <> D AND B <> E
  AND C <> D AND C <> E AND D <> E
  AND login1.eventID = 4624 AND login2.eventID = 4624 AND login3.eventID = 4624 
  AND prog1.eventID = 4688 AND prog2.eventID = 4688 AND prog3.eventID = 4688
  AND nf1.dstBytes > 100000000 AND nf2.dstBytes > 100000000 AND nf3.dstBytes > 100000000
  // constraints across paths
  AND login1.epochtime < login2.epochtime
  AND login2.epochtime < login3.epochtime
  AND login3.epochtime - login1.epochtime < 3600
  AND nf1.dstPort = nf2.dstPort AND nf2.dstPort = nf3.dstPort
  AND prog1.processName = prog2.processName AND prog2.processName = prog3.processName
  // time constraints within each path
  AND login1.epochtime < prog1.epochtime
  AND prog1.epochtime < nf1.epochtime
  AND nf1.epochtime - login1.epochtime <= 30
  AND login2.epochtime < prog2.epochtime
  AND prog2.epochtime < nf2.epochtime
  AND nf2.epochtime - login2.epochtime <= 30
  AND login3.epochtime < prog3.epochtime
  AND prog3.epochtime < nf3.epochtime
  AND nf3.epochtime - login3.epochtime <= 30 
RETURN login1.epochtime as time1, login2.epochtime as time2,
  login3.epochtime as time3, login3.epochtime - login1.epochtime as interval,
  nf1.dstPort as dport1, nf2.dstPort as dport2, nf3.dstPort as dport3
"""
data = run_query(q)
print('Number of answers: {:,}'.format(data.num_rows))

Launched job 10
Number of answers: 1
CPU times: user 266 ms, sys: 156 ms, total: 422 ms
Wall time: 22min 12s


In [10]:
pdata = data.get_data_pandas()
pdata

Unnamed: 0,time1,time2,time3,interval,dport1,dport2,dport3
0,134790,134952,137026,2236,443,443,443
