![DLI Logo](../images/DLI_Header.png)

# An Introduction to NLP

In this notebook, you will get a high level introduction to key Natural Language Processing (NLP) concepts in preparation for utilizing NLP in Morpheus pipelines.

## Objectives

By the time you complete this notebook you will:

- Understand how many common cybersecurity tasks can be addressed with natural language processing.
- Understand enough about NLP to begin using it in Morpheus pipelines.
- Know where to go to learn more about NLP should you wish.

---

## Cybersecurity Tasks as NLP Problems

Many cybersecurity tasks that deal with textual information - log parsing/analysis, phishing detection, to name a few - can be treated as natural language processing (**NLP**) problems. NLP is a **deep learning** technique whereby **models** can be **trained** to understand characteristics of text in ways that emulate or even surpass human capabilities.

In the case of cybersecurity, traditional methods for textual analysis rely on rules-based systems, often relying heavily on regex. Regex rules can be brittle, and require constant maintenance to address novel kinds of text. NLP models, however, can learn to identify characteristics in text that is generalized and capable to make accurate assessments of never before seen inputs.

---

## Sequence Classification

More specifically, we would like to utilize NLP to perform [sequence classification](https://docs.nvidia.com/tao/tao-toolkit/text/nlp/text_classification.html) on cybersecurity logs. In sequence classification we define categories that we would like to find and classify within unstructured text.

---

## A Non-Cybersecurity Example

As an example, let's say we would like to be able to identify if these categories of text reside in written paragraphs:
- name
- location
- time

Given the following paragraph...

> Both our children celebrated birthdays last week: Rowan on Wednesday and Cora on Sunday. For Ro's birthday we went to the beach with friends, for Cora's we hung out in the backyard with their grandparents.

...sequence classification might generate the following:

```
{
    "name": true,
    "location": true,
    "time": true
}
```    

...named entity recognition might generate the following:

> Both our children celebrated birthdays **last week _(time)_**: **Rowan _(name)_** on **Wednesday _(time)_** and **Cora _(name)_** on **Sunday _(time)_**. For **Ro's _(name)_** birthday we went to **the beach _(location)_** with friends, for **Cora's _(name)_** we hung out in **the backyard _(location)_** with their grandparents.

---

## Sequence Classification for Cybersecurity

In this workshop we will use NLP tools that ship with Morpheus to perform sequence classification on packet captures to identify the following classes of sensitive information:

- address
- bank_acct
- credit_card
- email
- govt_id
- name
- password
- phone_num
- secret_keys
- user

---

## Additional Resources

For those of you who would like to take a deeper dive into NLP please consider the following resources:

- *[CyBERT](https://medium.com/rapids-ai/cybert-28b35a4c81c4)*: This blog post covers how NVIDIA built on top of the state-of-the-art BERT NLP model to develop the pretrained NLP model used in Morpheus for cybersecurity-relevant named-entity recognition.
- *[Preprocess Your Data at Lightspeed with Our GPU-based Tokenizer for BERT Language Models](https://medium.com/rapids-ai/preprocess-your-training-data-at-lightspeed-with-our-gpu-based-tokenizer-for-bert-language-models-561cf9c46e15)*: This technical blog post discusses how NVIDIA has pushed the limits of CyBERT performance on GPUs.
- _[Building Transformer-Based Natural Language Processing Applications](https://courses.nvidia.com/courses/course-v1:DLI+C-FX-03+V3/about)_: This Deep Learning Institute workshop gives extensive and interactive coverage to building state-of-the-art NLP applications.

## Next

With a very high-level understanding of sequence classification, you have enough context to begin using NLP in Morpheus pipelines.

Please continue to the next notebook.