![ENSIGN logo](Hypergraph-Analytics.jpeg "Portscan Pattern")

# ENSIGN Cyber Workflow
* Reservoir Labs' AWS Sagemaker Algorithm delivers powerful insights into compute networks.
* Using `conn` logs generated by Zeek, the ENSIGN Cyber workflow decomposes network traffic into a small set of distinct patterns.
* View these patterns in the `decomp.pdf` output.
* Read more detailed information on the patterns in the `report.txt` output.

## Understanding the results

ENSIGN creates insights by performing a tensor decomposition. This is an unsupervised machine learning method that breaks down a dataset into a configurable number of components. To perform the decomposition we need to represent Zeek Conn logs as tensors.

A tensor is a multidimensional array. The number of dimensions of a tensor is commonly known as the order, while each dimension is called a mode. A tensor of order one is a vector; a tensor of order two is commonly known as a matrix. In the Cyber Workflow, we create an order four tensor using as modes the datetime, source IP, destination, IP and port. 

A tuple of indices identifies an element (also known as an entry or value) into the tensor. The elements of a mode are mapped to the indices for that mode in the tensor. The number of indices for a given mode is often referred to as the mode size. The indices run from 0 to the size of the mode. These data values corresponding to tensor indices are known as labels. For example, imagine the entry at indices: (0, 0, 0, 0) had the value three. The labels of those indices could be: ("2021-10-31 12:52:00", "192.168.9.99", "192.168.12.12", "22"). In English, this means that the IP 192.168.9.99 connected to IP 192.168.12.12 on port 22 on October 31st, 2021 at 12:52:00. The value three at that place in the tensor means that there were three such flows in the input data.

Modes are sorted to ease interpretability so that larger-valued indices correspond to labels later in the sorted order. e.g. 0 -> 2021-01-01 00:00:00, 1 -> 2021-01-01 00:00:01, and so on.

We can apply a function to map tensor labels to bins. One example may be to apply a function mapping a timestamp to minute-size bins. When a function is used to map the labels of a mode to bins, each bin is then mapped to a single index for that mode in the tensor and each bin gets a unique label. i.e. timestamps are mapped to labels with the format yyyy-MM-dd hh:mm:00. The mapping of binned-labels to their mode-indices are found in the map_mode_x.txt files. The label on line 0 has index 0, the label on line 12 has index 12, and so on.

A tensor is termed sparse if the majority of its entries are zero. Non-zero entries give us the count of those  flows. Most real-world data is sparse, just consider our tensors representing Conn logs. If the tensor were fully dense (no zero entries), that would imply that in the time range represented by the input, there was at least one network flow for every combination of source IP, destination IP and port, during every single minute. Obviously, network hosts are not connecting to every other host on the network on every port every minute so the tensors representing network data tend to be sparse. ENSIGN is optimized for calculations on sparse tensors, and alternative data structures and methods of storage are used to prevent data explosion when dealing with large mode sizes.

A tensor that can be formed from the scaled outer product of vectors is said to have rank 1. The rank of a tensor is the smallest number of rank 1 tensors that would need to be summed to reconstruct that tensor.

A tensor decomposition breaks apart a tensor into R components, where R is a positive integer known as the rank of the decomposition. If the rank is not specified as a hyperparameter, it is chosen to be the integer closest to ln(nnz), where nnz is the number of non-zero entries in the tensor. Each component comprises 4 vectors (Datetime, Source IP, Dest. IP, and Port modes) and a scalar weight. The size of each vector is equal to the size of the corresponding mode in the tensor. The entries of the vector are referred to as scores and are real values in range of -1 and 1 (or 0 and 1 for non-negative decompositions). Alternatively, scores can be organized as a set of 4 factor matrices. The columns of the factor matrix represent the component vectors along that mode. The factor matrix is correspondingly of size R columns, where R is the rank of the decomposition. Factor matrices are found in the decomp_mode_x.txt files.

We can proceed to rebuild an approximation of the original tensor using its components through reconstruction. For each component, we take the outer product of the set of vectors and scale it by the component’s weight. The dimension of this outer product will equal the number of modes in the original tensor. We then sum the R 4-dimensional rank-1 tensors pointwise. Intuitively, each component captures a cluster of non-zero entries in the original tensor that are related by correlations of indices across those entries. In practice, components tend to capture specific behaviors in the originating data.

A number of metrics about the decomposition procedure are recorded in the metrics.txt file. Among them is the "fit", which is how close the approximated tensor is to the original tensor. Additionally, the "coverage" is the percentage of the original dataset that is captured in the decomposition.

### Component visualization
An example of a pattern (component) is shown below
* `Component 13` Identifier of the component. They are ordered by their weight
* `Weight 24060.53` The weight of the component. This roughly corresponds to how many log lines in the original input data are represented by this pattern.
    * The pattern is defined by the flows (timestamp, source IP, dest. IP, port) that are scored non-zero in the component.
* Four subplots are also present. `timestamp`, `src.ip`, `dst.ip`, `dst.port` These represent the various aspects (modes) of the component. The x-axis enumerates the labels in the mode (timestamps, IPs, and ports respectively) while the y-axis gives us the score (0-1) of the label, indicating the presence or absence of that entity in the pattern.
* To the right of the subplots are the top scoring labels of each mode. The three columns are the index of the label, the label itself, and the score, respectively.

![Portscan Pattern](portscan_component.png "Portscan Pattern")

* The scoring of the timestamp mode indicates that this behavior this component represents is periodic.
* Source and destination IP modes have only one nonzero label each, indicating that only a single source IP and a single destination IP are involved in the behavior.
* Many ports are scored with roughly equal weight in the last mode.

#### The combination of these traits is commonly associated with port-scanning behavior.

### Textual Report

![Report Sample](report_sample.png "Report Sample")

The report is another way to understand the results of the ENSIGN Cyber workflow. 
* The heading contains metadata about the decomposition including the number of entries (nonzeros) from the input logs and the sizes of the modes (number of unique labels). The rank indicates how many components are in the decomposition.
* Similarly to the visualization, the report contains a subsection for each component and a further subsection for each mode within.
* The top labels of each mode are shown as in the visualization along with the output from our custom post-processes.

### Cyber post-processing

We believe much can be done with the components of tensor decompositions. Here we include a couple examples of post-processing routines that we use with our network data. The results of these post-processes are also included in the Cyber Workflow.

#### Port-scan detector (seen above)
    * Categorizes a component as containing potential port-scan behavior when many ports are scored significantly

#### Network-mapping detector
    * Categorizes a component as containing potential network-mapping behavior when many destination IPs are scored significantly
    
Below are samples of a network-mapping displayed in the report and visualization:

![Network-mapping report sample](netmap_report_sample.png "Network-mapping report sample")

![Network-mapping visual sample](netmap_viz.png "Network-mapping visual sample")

#### Beaconing detector
    * Categorizes a component as containing potential beaconing behavior when the time mode displays a periodic pattern

![Beaconing report sample](beaconing_report.png "Beaconing report sample")

![Beaconing visual sample](beaconing_viz.png "Beaconing visual sample")

### Conclusion

The ENSIGN Cyber Workflow can provide unique insights into your network and can be integrated with other cybersecurity practices. Run as a daily workflow, certain components will show up frequently and can be used to establish a baseline of network behavior. Additional unseen components can present new types of behaviors. Further adapt the ENSIGN Cyber Workflow to your needs or automate the extraction of insights with custom post-processes. To learn about more ENSIGN visit www.reservoir.com/ensign.