# ENSIGN Cyber Workflow
* Reservoir Labs' AWS Sagemaker Algorithm delivers powerful insights into compute networks.
* Using `conn` logs generated by Zeek, the ENSIGN Cyber workflow decomposes network traffic into a small set of distinct patterns.
* View these patterns in the `decomp.pdf` output.
* Read more detailed information on the patterns in the `report.txt` output.

## Running the workflow
* `job_name` An identifier for this instance of the workflow.
* `filepaths` A string or list of strings of filepaths pointing to conn logs to be used as input to the workflow. They must be relative to the `s3_input_path`.
* `s3_input_path` An S3 path pointing to a bucket containing the input logs.
* `s3_output_path` An S3 path pointing to a bucket where the output will be saved.
* `role_arn` The ARN of a AWS Role that has permissions to run Sagemaker training jobs and access the specified S3 buckets.
* `region_name` The region where the Sagemaker training job will be scheduled.
* `instance_type` The Sagemaker instance type used to perform the computation. This should have enough memory to fit the dataset into.
* `volume_size_gb` The size of the volume attached to the compute instance. Must be large enough to fit the dataset.
* `max_runtime_seconds` The maximum number of seconds to allow the computation to run. Larger datasets can take several hours to decompose.
* `rank` An optional parameter. The number of patterns to produce in the decomposition. 

In [1]:
from cyber_workflow import cyber_workflow
cyber_workflow(
    job_name="cyber-workflow",
    filepaths="conn.log",
    s3_input_path="s3://path/to/conn_example/",
    s3_output_path="s3://path/to/conn_example/",
    role_arn="ROLE_ARN",
    region_name="us-east-1",
    instance_type='ml.m5.large',
    volume_size_gb=1,
    max_runtime_seconds=3600,
    rank=10
)

[2021-10-12 20:29:33.124466] Starting
[2021-10-12 20:32:13.634214] Downloading
[2021-10-12 20:32:30.812942] Training
[2021-10-12 20:34:42.995000] 2021-10-13 00:34:39,295 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[2021-10-12 20:34:42.996000] 2021-10-13 00:34:39,295 sagemaker-training-toolkit INFO     Failed to parse hyperparameter filepaths value conn.log to Json.
[2021-10-12 20:34:42.996000] Returning the value itself
[2021-10-12 20:34:42.996000] 2021-10-13 00:34:42,343 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[2021-10-12 20:34:42.996000] 2021-10-13 00:34:42,343 sagemaker-training-toolkit INFO     Failed to parse hyperparameter filepaths value conn.log to Json.
[2021-10-12 20:34:42.996000] 2021-10-13 00:34:42,353 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
[2021-10-12 20:34:42.996000] 2021-10-13 00:34:42,354 sagemaker-training-toolkit INFO     Failed to parse hyperp

[2021-10-12 20:35:09.203221] Completed
{'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:998216925257:training-job/cyber-workflow', 'ResponseMetadata': {'RequestId': '59ec4df6-ec08-4175-9d6e-c9a0782afdcf', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '59ec4df6-ec08-4175-9d6e-c9a0782afdcf', 'content-type': 'application/x-amz-json-1.1', 'content-length': '89', 'date': 'Wed, 13 Oct 2021 00:29:30 GMT'}, 'RetryAttempts': 0}}


## Understanding the results

### Component visualization
An example of a pattern (component) is shown below
* `Component 13` Identifier of the component. They are ordered by their weight
* `Weight 24060.53` The weight of the component. This roughly corresponds to how many log lines in the original input data are represented by this pattern
* Four subplots are also present. `timestamp`, `src.ip`, `dst.ip`, `dst.port` These represent the various aspects (modes) of the component. The x-axis enumerates the labels in the mode (timestamps, IPs, and ports respectively) while the y-axis gives us the score (0-1) of the label, indicating the presence or absence of that entity in the pattern.
* To the right of the subplots are the top scoring labels of each mode. The three columns are the index of the labe;, the label itself, and the score, respectively.

![Portscan Pattern](portscan_component.png "Portscan Pattern")

* The scoring of the timestamp mode indicates that this behavior this component represents is periodic.
* Source and destination IP modes have only one nonzero label each, indicating that only a single source IP and a single destination IP are involved in the behavior.
* Many ports are scored with roughly equal weight in the last mode.

#### The combination of these traits is commonly associated with port-scanning behavior.

### Textual Report

![Report Sample](report_sample.png "Report Sample")

The report is another way to understand the results of the ENSIGN Cyber workflow. 
* The heading contains metadata about the decomposition including the number of entries (nonzeros) from the input logs and the sizes of the modes (number of unique labels). The rank indicates how many components are in the decomposition.
* Similarly to the visualization, the report contains a subsection for each component and a further subsection for each mode within.
* The top labels of each mode are shown as in the visualization alongside other features such as "burst" behavior.