![DLI Logo](../images/DLI_Header.png)

# Exercise: Build a SID Pipeline

Construct an NLP pipeline for `data/pcap_dump.jsonlines`, which has reduced in size from the previous section in order to complete more quickly.

In addition to correctly ingesting, preprocessing, inferencing, and outputting data, your pipeline should also do the following:

- Utilize the class labels file `data/labels.txt`
- Keep all messages with a probability of containing sensitive information greater than .2 using the `filter` stage
- Include only the `timestamp`, `host_ip`, and `data` fields in the output using the `serialize` stage
- Save the output as `data/output/output.jsonlines`

Because the incoming data has been so greatly reduced in size, your pipeline should complete within a few seconds and you should be able to run it from the notebook without issue. If you find that the pipeline is taking much longer than a few seconds, it could be that you have leftover processes from previous notebooks consuming GPU memory. To fix this use the Jupyter menu to select _Kernel_ -> _Shut Down All Kernels..._ and then try again.

---

## Your Work Here

In [1]:
!morpheus run \
  pipeline-nlp \
    --labels_file=data/labels.txt \
  from-file --filename=data/pcap_dump.jsonlines \
  deserialize \
  preprocess \
    --vocab_hash_file=data/bert-base-uncased-hash.txt \
    --truncation=True \
    --do_lower_case=True \
  inf-triton \
    --model_name=sid-minibert-onnx \
    --server_url=triton:8001 \
  add-class \
  filter --threshold=0.2 \
  serialize \
    --include='timestamp' \
    --include='host_ip' \
    --include='data' \
    --exclude='data_len' \
  to-file --filename=data/output/output.jsonlines --overwrite

[32mConfiguring Pipeline via CLI[0m
[31mStarting pipeline via CLI... Ctrl+C to Quit[0m


---

## Explore Your Results

To assist your work, you can use the following cells to load your output file and examine its contents.

In [2]:
import pandas as pd

In [3]:
results = pd.read_json('data/output/output.jsonlines', lines=True)

In [4]:
results.shape

(17, 3)

In [5]:
results

Unnamed: 0,timestamp,host_ip,data
0,2021-03-22 03:57:11.671,10.188.40.56,"""{\""name\"": \""kxpmO8AR 04893905\"", \""lang\"": \..."
1,2021-03-22 03:57:11.876,10.188.40.56,"""{\""name\"": \""impact 58JjryDh kind 16742730 el..."
2,2021-03-22 03:57:11.980,10.188.40.56,"""{\""query\"": \""tgaLEqlT leader plant 43791449\..."
3,2021-03-22 03:57:12.291,10.188.40.56,"""{\""id\"": -432436979, \""userFields\"": \""Patric..."
4,2021-03-22 03:57:12.393,10.188.40.56,"""{\""id\"": 290110817, \""userFields\"": \""Stefani..."
5,2021-03-22 03:57:12.397,10.188.40.56,"""{\""id\"": 966611706, \""userFields\"": \""Adam Fl..."
6,2021-03-22 03:57:12.602,10.188.40.56,"""{\""id\"": 156558966, \""userFields\"": \""Nathani..."
7,2021-03-22 03:57:12.703,10.188.40.56,"""{\""id\"": -674262537, \""userFields\"": \""Rhonda..."
8,2021-03-22 03:57:12.707,10.188.40.56,"""{\""id\"": -386527694, \""userFields\"": \""John R..."
9,2021-03-22 03:57:12.810,10.188.40.56,"""{\""id\"": -477637789, \""userFields\"": \""Elizab..."


---

## Check Your Work

To check your work, run the following cell. It will run the shell command `diff` against an existing solution output (`data/output/solution.jsonlines`) and your output (`data/output/output.jsonlines`). If the files are identical there will be no output and you will know your work is correct.

In [6]:
!diff data/output/output.jsonlines data/output/solution.jsonlines

---

## Solution

If you get stuck, click on the `...` directly below to expand the solution pipeline.

```sh
morpheus run \
  pipeline-nlp \
    --labels_file=data/labels.txt \
  from-file --filename=data/pcap_dump.jsonlines \
  deserialize \
  preprocess \
    --vocab_hash_file=data/bert-base-uncased-hash.txt \
    --truncation=True \
    --do_lower_case=True \
  inf-triton \
    --model_name=sid-minibert-onnx \
    --server_url=triton:8001 \
  add-class \
  filter --threshold=0.2 \
  serialize \
    --include='timestamp' \
    --include='host_ip' \
    --include='data' \
    --exclude='data_len' \
  to-file --filename=data/output/output.jsonlines --overwrite
```

---

## Next

In the next section you will take a quick break from programming to look at and discuss how Morpheus is already being used in the industry.

Please continue to the next notebook.