# Natural Language Processing

Natural Language Processing (NLP) is the field of Artificial Intelligence concerned with the processing, understanding, and generation of human language. It is used extensively in search engines, conversational interfaces, document processors, and so on. Machines can handle structured data well, even semi-structured data like images, but when it comes to working with free-form text, they have a hard time. Language and text is very versatile.

The goal of NLP is to develop algorithms that enable computers to understand free-form text, understand the information embedded in that text, and even generate an answer if it was a question. 

# High level overview of NLP

Let's first get started with a general and high level, but pretty complete, overview of the field of NLP.

<a href="https://www.youtube.com/watch?v=ZifynN2oyhs&ab_channel=AssemblyAI"><img src="resources/NLP_overview.png" width="800"></a>


In the next video (from 0:00 till 7:47) we'll dive a little deeper into the details of NLP


<a href="https://www.youtube.com/embed/oi0JXuL19TA?si=-8QG1G4aqUynjxfw?end=747"><img src="resources/NLP_video.png" width="800"></a>


# Natural Language Understanding (NLU)

NLU involves encoding free-form text in a way that allows us to process and model it effectively. By transforming text into knowledge representations, we create vectors that capture essential information. For instance, when processing an email, an NLU system generates a vector embodying crucial details—enabling classification (e.g., HR, finance, or legal).

**Encoder models** combine free-form texts into relevant vector representations aligned with specific domains.

# Natural Language Generation (NLG)

NLG complements NLU by creating human-like text from structured data or other non-linguistic inputs. NLG systems transform this information into coherent sentences, paragraphs, or longer texts. Think of NLG as the counterpart to NLU—it focuses on generating meaningful content from structured input. **Decoder models** play a key role in NLG, converting encoded vectors into expressive, contextually appropriate language. 

## Word Meaning:

Words themselves lack inherent meaning. Their significance arises from context and usage.
AI systems must learn the meaning of words, disambiguate homonyms, and handle potential mistakes (like typos or slang).

## Representing and Comparing Words:

Count Vectors: These provide a simple way to compare words based on their co-occurrence in sentences. However, they demand significant storage space.
Word Embeddings: Unsupervised learning techniques create word representations. These embeddings reveal semantic relationships. For instance, similar words cluster together.
Visualizing Word Clusters: By plotting word embeddings, we can visualize how words relate to each other in a high-dimensional space.
Encoder-Decoder Models: These capture sentence meaning by encoding words into a shared representation. Recurrent Neural Networks (RNNs) are common examples. They enable predictions for the next word in a sequence.

# Core NLP techniques:

## Tokenization:

Tokenization is the process of converting a sequence of characters into a sequence of tokens. In the context of NLP, tokens typically represent words, phrases, or other meaningful elements of language.
It’s a critical first step in NLP, because it transforms raw text into an organized structure that a machine learning model can understand and analyze. By breaking down text into tokens, we can start to analyze the frequency and distribution of words, which is essential for many NLP tasks.

## Stemming & Lemmatization:

Stemming (& lemmatization) is a technique used to reduce words to their root form. For instance in the case of stemming: it involves chopping off the ends of words to find the base or stem. For example, “running,” “runner,” and “runs” all stem to “run.”
This simplification is particularly useful in search engines and text classification where the exact form of a word is less important than the general meaning. Stemming helps in reducing the complexity of the text and improves the performance of NLP models.
Lemmatization is a technique to reduce words back to their root form, but using a dictionary.

## Named Entity Recognition (NER):

NER is a method by which an AI system finds and classifies named entities in text into predefined categories such as names of people, organizations, locations, dates, etc.
It’s a key part of information extraction that allows us to quickly understand and organize large amounts of data by highlighting the most important pieces of information.

## Additional Preprocessing Techniques:

**Stop Words Removal**: In NLP, stop words like “the,” “is,” and “and” are often removed from the text. These words are filtered out because they are common and do not carry significant meaning.
Other steps might include **normalizing text** by converting it to lowercase, removing punctuation, correcting spelling errors, and more. These steps are crucial for reducing noise in the data and improving the accuracy of subsequent NLP tasks.

Together, these preprocessing techniques lay the groundwork for more complex NLP tasks, such as sentiment analysis, machine translation, and question-answering systems. They help in transforming raw text into a more structured form that’s amenable to analysis and insight extraction.