![DLI Logo](../images/DLI_Header.png)

# Sensitive Information Detection with Morpheus

In this notebook, you will begin working with the NLP pipeline to perform sensitive information detection (SID) on packet capture data.

## Objectives

By the time you complete this notebook you will:

- Be familiar with the key features specific to the Morpheus NLP pipeline.
- Be able to perform sensitive information detection using the Morpheus NLP pipeline.

---

## The Morpheus NLP Pipeline

Morpheus offers `pipeline-nlp` and ships with an already-trained BERT-based model that can identify several kinds of sensitive data. We will now turn our attention to utilizing `pipeline-nlp` to perform sensitive information detection (SID) on a packet capture dataset.

---

## Data Overview

The source data for our pipeline will be `data/pcap_dump.jsonlines`:

In [1]:
!ls -lh data/pcap_dump.jsonlines

-rw-rw-rw- 1 root root 45M Mar 19  2022 data/pcap_dump.jsonlines


`pcap_dump.jsonlines` contains 93085 packet captures, each represented as a JSON object.

In [2]:
!cat data/pcap_dump.jsonlines | wc -l # Count number of lines / packet captures.

93085


We will be using the [`jq`](https://stedolan.github.io/jq/) library to help us read both this input JSON data, and later the output data. We could also easily use a dataframe, but the `jq` output will be more legible.

Run the following cell to look at 2 arbitrary packet captures from the input data, paying special attention to the `data` fields, which might include sensitive information we would like know is being sent through the network.

In [3]:
# Look at 2 arbitrary packet captures at indices `1`, `31`.
!cat data/pcap_dump.jsonlines | jq -s '.[1,31]' | tr -d '\\' # Remove backslashes for easier reading.

{
  "timestamp": 1616380971991,
  "host_ip": "10.188.40.56",
  "data_len": "139",
  "data": ""{"markerEmail": "FuRLFaAZ identify benefit BneiMvCZ join 92694759"}"",
  "src_mac": "04:3f:72:bf:af:74",
  "dest_mac": "b4:a9:fc:3c:46:f8",
  "protocol": "6",
  "src_ip": "10.244.0.1",
  "dest_ip": "10.244.0.25",
  "src_port": "50410",
  "dest_port": "80",
  "flags": "24",
  "is_pii": false
}
{
  "timestamp": 1616380972831,
  "host_ip": "10.188.40.56",
  "data_len": "310",
  "data": "POST /simpledatagen/ HTTP/1.1rnHost: echo.gtc1.netqdev.cumulusnetworks.comrnUser-Agent: python-requests/2.22.0rnAccept-Encoding: gzip, deflaternAccept: */*rnConnection: keep-alivernContent-Length: 434rnContent-Type: application/jsonrnrn",
  "src_mac": "04:3f:72:bf:af:74",
  "dest_mac": "b4:a9:fc:3c:46:f8",
  "protocol": "6",
  "src_ip": "10.244.0.60",
  "dest_ip": "10.20.16.248",
  "src_port": "50436",
  "dest_port": "80",
  "flags": "24",
  "is_pii": false
}


## Pipeline Overview

In order to perform SID on this data we are going to utilize the following `pipeline-nlp`:

```sh
morpheus run \
  pipeline-nlp \
    --labels_file=data/labels_nlp.txt \
  from-file --filename=data/pcap_dump.jsonlines \
  deserialize \
  preprocess \
    --vocab_hash_file=data/bert-base-uncased-hash.txt \
    --truncation=True \
    --do_lower_case=True \
  inf-triton \
    --model_name=sid-minibert-onnx \
    --server_url=triton:8001 \
  add-class \
  serialize \
  to-file --filename=data/output/output.jsonlines --overwrite
```

Much of what you see in the pipeline will be familiar to you from your earlier work, but there are several differences from the FIL pipeline to note, which we will now turn our attention to.

---

## Labels File

The NLP pipeline expects a labels file listing the classes that the NLP model in use has been trained to identify:

In [4]:
!morpheus run pipeline-nlp --help | grep 'labels_file' -A 4

  --labels_file FILE              Specifies a file to read labels from in
                                  order to convert class IDs into labels. A
                                  label file is a simple text file where each
                                  line corresponds to a label  [default:
                                  data/labels_nlp.txt]


For our pipeline we have stored the labels file in its expected default location `data/labels_nlp.txt`, and it contains the kinds of sensitive information our model has been trained to detect:

In [5]:
!cat data/labels_nlp.txt

address
bank_acct
credit_card
email
govt_id
name
password
phone_num
secret_keys
user


---

## NLP Preprocessing

Given that each kind of Morpheus pipeline performs inference with a different kind of model, it makes sense that for each kind of pipeline that the actions performed by the `preprocessing` stage are distinct. In the case of `pipeline-nlp` there are several options available to us that are specific to the NLP model we will perform inference with:

In [6]:
!morpheus run pipeline-nlp preprocess --help

[32mConfiguring Pipeline via CLI[0m
Usage: morpheus run pipeline-nlp preprocess [OPTIONS]

Options:
  --vocab_hash_file FILE        Path to hash file containing vocabulary of
                                words with token-ids. This can be created from
                                the raw vocabulary using the
                                cudf.utils.hash_vocab_utils.hash_vocab
                                function. Default value is 'data/bert-base-
                                cased-hash.txt'  [default: data/bert-base-
                                cased-hash.txt]
  --truncation BOOLEAN          When set to True, any tokens extending past
                                the max sequence length will be truncated.
                                [default: False]
  --do_lower_case BOOLEAN       Converts all strings to lowercase.  [default:
                                False]
  --add_special_tokens BOOLEAN  Adds special tokens '[CLS]' to the beginning
                   

For our pipeline we will provide a vocabulary hash file which contains a mapping of words with token IDs, will truncate any tokens extending past a maximum sequence length, and will convert all incoming strings to lowercase:

```sh
preprocess \
--vocab_hash_file=data/bert-base-uncased-hash.txt \
--truncation=True \
--do_lower_case=True
```

---

## SID MiniBERT Model

For inference we will be utilizing a model trained on top of MiniBERT to perform detection of data matching the labels we viewed above. This model, which ships along with Morpheus, is called `sid-minibert-onnx`, and has already been loaded into Triton for use.

```sh
inf-triton \
  --model_name=sid-minibert-onnx \
  --server_url=triton:8001
```

In [7]:
!curl -s -X POST triton:8000/v2/repository/index | jq

[1;39m[
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"abp-nvsmi-xgb"[0m[1;39m,
    [0m[34;1m"version"[0m[1;39m: [0m[0;32m"1"[0m[1;39m,
    [0m[34;1m"state"[0m[1;39m: [0m[0;32m"READY"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"phishing-bert-onnx"[0m[1;39m,
    [0m[34;1m"version"[0m[1;39m: [0m[0;32m"1"[0m[1;39m,
    [0m[34;1m"state"[0m[1;39m: [0m[0;32m"READY"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"sid-minibert-onnx"[0m[1;39m,
    [0m[34;1m"version"[0m[1;39m: [0m[0;32m"1"[0m[1;39m,
    [0m[34;1m"state"[0m[1;39m: [0m[0;32m"READY"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


---

## Run the SID Pipeline

With all those details in place, we are ready to execute the following SID pipeline.

Compared to the FIL pipeline we have run in previous sections, this pipeline processes a lot more data and will take a significantly longer time to complete. With that in mind, we will run the pipeline in a Jupyter terminal, so we can investigate its output here in the notebook while it continues to run.

- Open a Jupyter terminal
- `cd` into `/dli/task/07-SID`
- Do `./launch_sid.sh` which contains the pipeline we have been describing along with a couple `monitor` stages to monitor throughput for the preprocessing and inference stages

You should see `Starting pipeline via CLI...` and shortly thereafter info about the preprocessing and inference rates for the pipeline:

```
Preprocessing rate: 93085messages [01:02, 1488.33messages/s]
Inference rate: 66560inf [01:02, 1070.50inf/s]
```

If you like, you can view the exact contents of `launch_sid.sh` by executing the cell below.

In [10]:
!./launch_sid.sh

[32mConfiguring Pipeline via CLI[0m
[31mStarting pipeline via CLI... Ctrl+C to Quit[0m
Preprocessing rate: 0messages [00:00, ?messages/s]
Inference rate: 0inf [00:00, ?inf/s][A
Preprocessing rate: 256messages [00:00, 544.14messages/s]
Preprocessing rate: 33280messages [00:02, 544.14messages/s]
Preprocessing rate: 33280messages [00:03, 544.14messages/s]
Preprocessing rate: 33280messages [00:04, 544.14messages/s]
Preprocessing rate: 33280messages [00:05, 544.14messages/s]
Preprocessing rate: 33280messages [00:06, 544.14messages/s]
Preprocessing rate: 33280messages [00:07, 544.14messages/s]
Preprocessing rate: 33280messages [00:08, 544.14messages/s]
Preprocessing rate: 33280messages [00:09, 544.14messages/s]
Preprocessing rate: 33280messages [00:10, 544.14messages/s]
Preprocessing rate: 33280messages [00:11, 544.14messages/s]
Preprocessing rate: 33280messages [00:12, 544.14messages/s]
Preprocessing rate: 33280messages [00:13, 544.14messages/s]
Preprocessing rate: 33280messages [00:14

---

## Results Overview

Now that the pipeline is running, we can confirm that it is writing its results to `data/output/output.jsonlines` as we specified in the pipeline.

In [11]:
!ls data/output

output.jsonlines


If you don't see `output.jsonlines` in the output above, wait a few seconds for the pipeline to spin up, and run the cell again until you can confirm it is present.

Running the cell below will show that the output file is being continuously written to by the active pipeline.

In [12]:
# Get the length of the output file several times for 10 seconds.
!for i in {1..5}; do cat data/output/output.jsonlines | wc -l; sleep 2; done

0
0
0
0
0


Examining the output we can see evidence that the pipeline has correctly labeled packets containing the classes of sensitive information we passed into the pipeline. For example, note the `"user"` field for the following entry:

In [13]:
!cat data/output/output.jsonlines | jq -s '.[100]' | tr -d '\\' # View a packet capture annotated with SID labels.

null


---

## Sample Outputs

Here is a small collection of sample outputs highlighting some examples of our pipeline's results.

### Secret Keys

In [14]:
# Show the first 3 data fields for packets identified as having secret keys.
!cat data/output/output.jsonlines | \
  jq -s 'map(select(.secret_keys == 1) | {data: .data, secret_key: .secret_keys})[0:3]' | \
  `# Remove backslashes for easier reading.` \
  tr -d '\\'

[]


### Email and Password

In [15]:
# Show the first 3 data fields for packets identified as having both email addresses and passwords.
!cat data/output/output.jsonlines | \
  jq -s 'map(select(.email == 1 and .password == 1) | {data: .data, email: .email, password: .password})[0:3]' | \
  `# Remove backslashes for easier reading.` \
  tr -d '\\'

[]


---

## Next

In the next section you will apply what you've learned about the NLP pipeline to construct one of your own.

Please continue to the next notebook.