# Understand the Transformer architecture and explore large language models in Azure Machine Learning
Foundation models, such as GPT-3, are state-of-the-art natural language processing models designed to understand, generate, and interact with human language. To understand the significance of foundation models, it's essential to explore their origins, which stem from advancements in the fields of artificial intelligence and natural language processing.

## Understand natural language processing
Natural language processing (NLP) is a type of AI that focuses on understanding, interpreting, and generating human language. Some common NLP use cases are:

1. Speech-to-text and text-to-speech conversion. For example, generate subtitles for videos.
2. Machine translation. For example, translate text from English to Japanese.
3. Text classification. For example, label an email as spam or not spam.
4. Entity extraction. For example, extract keywords or names from a document.
5. Question answering. For example, provide answers to questions like "What is the capital of France?"
6. Text summarization. For example, generate a short one-paragraph summary from a multi-page document.

Historically, NLP has been challenging as our language is complex and computers find it hard to understand text. In this module, you learn how developments in AI and specifically NLP have led to the models we use today. You'll explore and use various language models in the model catalog, available in the Azure Machine Learning studio.


# Understand statistical techniques used for natural language processing (NLP)
Over the last decades, multiple developments in the field of natural language processing (NLP) have resulted in achieving large language models (LLMs).

To understand LLMs, let's first explore the statistical techniques for NLP that over time have contributed to the current techniques.

## The beginnings of NLP
As NLP is focused on understanding and generating text, most first attempts at accomplishing NLP were based on using the rules and structure inherent to languages. Especially before machine learning techniques became prevalent, structural models and formal grammar were the primary methods employed.

These approaches relied on explicit programming of linguistic rules and grammatical patterns to process and generate text. Though these models could handle some specific language tasks reasonably well, they faced significant challenges when confronted with the vast complexity and variability of natural languages.

Instead of hard-coding rules, researchers in the 1990s began to utilize statistical and probabilistic models to learn patterns and representations directly from data.

## Understanding tokenization
As you may expect, machines have a hard time deciphering text as they mostly rely on numbers. To read text, we therefore need to convert the presented text to numbers.

One important development to allow machines to more easily work with text has been tokenization. Tokens are strings with a known meaning, usually representing a word. Tokenization is turning words into tokens, which are then converted to numbers. A statistical approach to tokenization is by using a pipeline:

![alt text](assets/tokenization-pipeline.gif)

1. Start with the text you want to tokenize.
2. Split the words in the text based on a rule. For example, split the words where there's a white space.
3. Stemming. Merge similar words by removing the end of a word.
4. Stop word removal. Remove noisy words that have little meaning like the and a. A dictionary of these words is provided to structurally remove them from the text.
5. Assign a number to each unique token.

Tokenization allowed for text to be labeled. As a result, statistical techniques could be used to let computers find patterns in the data instead of applying rule-based models.

## Statistical techniques for NLP
Two important advancements to achieve NLP used statistical techniques: Naïve Bayes and Term Frequency - Inverse Document Frequency (TF-IDF).


### Understanding Naïve Bayes
Naïve Bayes is a statistical technique that was first used for email filtering. To learn the difference between spam and not spam, two documents are compared. Naïve Bayes classifiers identify which tokens are correlated with emails labeled as spam. In other words, the technique finds which group of words only occurs in one type of document and not in the other. The group of words is often referred to as bag-of-words features.

For example, the words miracle cure, lose weight fast, and anti-aging may appear more frequently in spam emails about dubious health products than your regular emails.

Though Naïve Bayes proved to be more effective than simple rule-based models for text classification, it was still relatively rudimentary as only the presence (and not the position) of a word or token was considered.

### Understanding TF-IDF
The Term Frequency - Inverse Document Frequency (TF-IDF) technique had a similar approach in that it compared the frequency of a word in one document with the frequency of the word in a whole corpus of documents. By understanding in which context a word was being used, documents could be classified based on certain topics. TF-IDF is often used for information retrieval, to help understand which relative words or tokens to search for.

>In the context of NLP, a corpus refers to a large and structured collection of text documents that is used for machine learning tasks. Corpora (plural of corpus) serve as essential resources for training, testing, and evaluating various NLP models.

For example, the word flour may often occur in documents that include recipes for baking. If searching for documents with flour, documents that include baking can also be retrieved as the words are often used together in a text.

TF-IDF proved to be useful for search engines in understanding a document's relevance to someone's search query. However, the TF-IDF technique doesn't take the semantic relationship between words into consideration. Synonyms or words with similar meanings aren't detected.

Though statistical techniques were valuable developments in the field of NLP, deep learning techniques created the necessary innovations to accomplish the level of NLP we have today.


# Understand the deep learning techniques used for NLP

Statistical techniques were relatively good at Natural Language Processing (NLP) tasks like text classification. For tasks like translation, there was still much room for improvement.

A recent technique that has advanced the field of Natural Language Processing (NLP) for tasks like translation is deep learning.

When you want to translate text, you shouldn't just translate each word to another language. You may remember translation services from years ago that translated sentences too literally, often resulting in interesting results. Instead, you want a language model to understand the meaning (or semantics) of a text, and use that information to create a grammatically correct sentence in the target language.

## Understand word embeddings
One of the key concepts introduced by applying deep learning techniques to NLP is word embeddings. Word embeddings solved the problem of not being able to define the semantic relationship between words.
Before word embeddings, a prevailing challenge with NLP was to detect the semantic relationship between words. Word embeddings represent words in a vector space, so that the relationship between words can be easily described and calculated.

Word embeddings are created during self-supervised learning. During the training process, the model analyzes the cooccurrence patterns of words in sentences and learns to represent them as vectors. The vectors represent the words with coordinates in a multidimensional space. The distance between words can then be calculated by determining the distance between the relative vectors, describing the semantic relationship between words.

Imagine you train a model on a large corpus of text data. During the training process, the model finds that the words bike and car are often used in the same patterns of words. Next to finding bike and car in the same text, you can also find each of them to be used when describing similar things. For example, someone may drive a bike or a car, or buy a bike or a car at a shop.

The model learns that the two words are often found in similar contexts and therefore plots the word vectors for bike and car close to each other in the vector space.

Imagine we have a three-dimensional vector space where each dimension corresponds to a semantic feature. In this case, let's say the dimensions represent factors like vehicle type, mode of transportation, and activity. We can then assign hypothetical vectors to the words based on their semantic relationships:


![alt text](assets/word-embeddings-vectors.png)

1. Boat [2, 1, 4] is close to drive and shop, reflecting that you can drive a boat and visit shops near bodies of water.
2. Car [7, 5, 1] closer to bike than boat as cars and bikes are both used on land rather than on water.
3. Bike [6, 8, 0] is closer to drive in the activity dimension and close to car in the vehicle type dimension.
4. Drive [8, 4, 3] is close to boat, car and bike, but far from shop as it describes a different kind of activity.
5. Shop [1, 3, 5] is closest to bike as these words are most commonly used together.

>In the example, a three-dimensional plane is used to describe word embeddings and vector spaces in simple terms. Vector spaces are often multidimensional planes with vectors representing a position in that space, similar to coordinates in a two-dimensional plane.

Though word embeddings are a great approach to detecting the semantic relationship between words, it still has its problems. For example, words with different intents like love and hate often appear related because they're used in similar context. Another problem was that the model would only use one entry per word, resulting in a word with different meanings like bank to be semantically related to a wild array of words.


## Adding memory to NLP models
To understand text isn't just to understand individual words, presented in isolation. Words can differ in their meaning depending on the context they're presented in. In other words, the sentence around a word matters to the meaning of the word.

## Using RNNs to include the context of a word
Before deep learning, including the context of a word was a task too complex and costly. One of the first breakthroughs in including the context were Recurrent Neural Networks (RNNs).

RNNs consist of multiple sequential steps. Each step takes an input and a hidden state. Imagine the input at each step to be a new word. Each step also produces an output. The hidden state can serve as a memory of the network, storing the output of the previous step and passing it as input to the next step.

Imagine a sentence like:

*Vincent Van Gogh was a painter most known for creating stunning and emotionally expressive artworks, including ...*

To know what word comes next, you need to remember the name of the painter. The sentence needs to be completed, as the last word is still missing. A missing or masked word in NLP tasks is often represented with [MASK]. By using the special [MASK] token in a sentence, you can let a language model know it needs to predict what the missing token or value is.

Simplifying the example sentence, you can provide the following input to an RNN: Vincent was a painter known for [MASK]:

![alt text](assets/vincent-tokenized.png)

The RNN takes each token as an input, process it, and update the hidden state with a memory of that token. When the next token is processed as new input, the hidden state from the previous step is updated.

Finally, the last token is presented as input to the model, namely the [MASK] token. Indicating that there's information missing and the model needs to predict its value. The RNN then uses the hidden state to predict that the output should be something like Starry Night

![alt text](assets/recurrent-network.gif)

In the example, the hidden state contains the information Vincent, is, painter, and know. With RNNs, each of these tokens are equally important in the hidden state, and therefore equally considered when predicting the output.

RNNs allow for context to be included when deciphering the meaning of a word in relation to the complete sentence. However, as the hidden state of an RNN is updated with each token, the actual relevant information, or signal, may be lost.

In the example provided, Vincent Van Gogh's name is at the start of the sentence, while the mask is at the end. At the final step, when the mask is presented as input, the hidden state may contain a large amount of information that is irrelevant for predicting the mask's output. Since the hidden state has a limited size, the relevant information may even be deleted to make room for new and more recent information.

When we read this sentence, we know that only certain words are essential to predict the last word. An RNN however, includes all (relevant and irrelevant) information in a hidden state. As a result, the relevant information may become a weak signal in the hidden state, meaning that it can be overlooked because there's too much other irrelevant information influencing the model.

## Improving RNNs with Long Short-Term Memory
One solution to the weak signal problem with RNNs is a newer type of RNN: Long Short-Term Memory (LSTM). LSTM is able to process sequential data by maintaining a hidden state that is updated at each step. With LSTM, the model can decide what to remember and what to forget. By doing so, context that isn't relevant or doesn't provide valuable information can be skipped, and important signals can be persisted longer.

# Understand the transformer architecture used for NLP
The latest breakthrough in NLP is owed to the development of the Transformer architecture.

Transformers were introduced in the [Attention is all you need paper by Vaswani, et al. from 2017](https://arxiv.org/abs/1706.03762). The Transformer architecture provides an alternative to the Recurrent Neural Networks (RNNS) to do NLP. Whereas RNNs are compute-intensive since they process words sequentially, Transformers don't process the words sequentially, but instead process each word independently in parallel by using attention.

The position of a word and the order of words in a sentence are important to understand the meaning of a text. To include this information, without having to process text sequentially, Transformers use positional encoding.

## Understand positional encoding
Before Transformers, language models used word embeddings to encode text into vectors. In the Transformer architecture, positional encoding is used to encode text into vectors. Positional encoding is the sum of word embedding vectors and positional vectors. By doing so, the encoded text includes information about the meaning and position of a word in a sentence.

To encode the position of a word in a sentence, you could use a single number to represent the index value. For example:

|Token |	Index value |
| -- | -- |
|The |	0 |
|work |	1 |
|of |	2 |
|William |	3 |
|Shakespeare |	4 |
|inspired |	5 |
|many |	6 |
|movies |	7 |
|... |	... |

The longer a text or sequence, the larger the index values may become. Though using unique values for each position in a text is a simple approach, the values would hold no meaning, and the growing values may create instability during model training.

The solution proposed in the Attention is all you need paper uses sine and cosine functions, where pos is the position and i is the dimension:

$ PE_{(pos,2i)} = sin(pos/1000^{2i/d_{model}}) $

$ PE_{(pos,2i+1)} = cos(pos/1000^{2i/d_{model}}) $

When you use these periodic functions together to create, you can create unique vectors for each position. As a result, the values are within a range and the index doesn't get larger when a larger text is encoded. Also, these positional vectors make it easier for the model to calculate and compare the positions of different words in a sentence against each other.

## Understand multi-head attention
The most important technique used by Transformers to process text is the use of attention instead of recurrence.

Attention (also referred to as self-attention or intra-attention) is a mechanism used to map new information to learned information in order to understand what the new information entails.

Transformers use an attention function, where a new word is encoded (using positional encoding) and represented as a query. The output of an encoded word is a key with an associated value.

To illustrate the three variables that are used by the attention function: the query, keys, and values, let's explore a simplified example. Imagine encoding the sentence Vincent van Gogh is a painter, known for his stunning and emotionally expressive artworks. When encoding the query Vincent van Gogh, the output may be Vincent van Gogh as the key with painter as the associated value. The architecture stores keys and values in a table, which it can then use for future decoding:

|Keys|	Values |
|-|-|
|Vincent| Van Gogh	Painter |
|William| Shakespeare	Playwright |
|Charles| Dickens	Writer |

Whenever a new sentence is presented like Shakespeare's work has influenced many movies, mostly thanks to his work as a .... The model can complete the sentence by taking Shakespeare as the query and finding it in the table of keys and values. Shakespeare the query is closest to William Shakespeare the key, and thus the associated value playwright is presented as the output.

## Using the scaled dot-product to compute the attention function

To calculate the attention function, the query, keys, and values are all encoded to vectors. The attention function then computes the scaled dot-product between the query vector and the keys vectors.

$ Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V $

The dot-product calculates the angle between vectors representing tokens, with the product being larger when the vectors are more aligned.

The softmax function is used within the attention function, over the scaled dot-product of the vectors, to create a probability distribution with possible outcomes. In other words, the softmax function's output includes which keys are closest to the query. The key with the highest probability is then selected, and the associated value is the output of the attention function.

The Transformer architecture uses multi-head attention, which means tokens are processed by the attention function several times in parallel. By doing so, a word or sentence can be processed multiple times, in various ways, to extract different kinds of information from the sentence.

## Explore the Transformer architecture
In the Attention is all you need paper, the proposed Transformer architecture is modeled as:


![alt text](assets/transformer-architecture.jpg)

There are two main components in the original Transformer architecture:

- The encoder: Responsible for processing the input sequence and creating a representation that captures the context of each token.
- The decoder: Generates the output sequence by attending to the encoder's representation and predicting the next token in the sequence.

The most important innovations presented in the Transformer architecture were positional encoding and multi-head attention. A simplified representation of the architecture, focusing on these two components may look like:

![alt text](assets/simplified-transformer-architecture.png)

- In the encoder layer, an input sequence is encoded with positional encoding, after which multi-head attention is used to create a representation of the text.
- In the decoder layer, an (incomplete) output sequence is encoded in a similar way, by first using positional encoding and then multi-head attention. Then, the multi-head attention mechanism is used a second time within the decoder to combine the output of the encoder and the output of the encoded output sequence that was passed as input to the decoder part. As a result, the output can be generated.

The Transformer architecture introduced concepts that drastically improved a model's ability to understand and generate text. Different models have been trained using adaptations of the Transformer architecture to optimize for specific NLP tasks.


# Explore foundation models in the model catalog
The Transformer architecture has allowed us to train models for Natural Language Processing (NLP) in a more efficient way. Instead of processing each token in a sentence or sequence, attention allows a model to process tokens in parallel in various ways.

To train a model using the Transformer architecture, you need to use a large amount of text data as input. Different models have been trained, which mostly differ by the data they've been trained on, or by how they implement attention within their architectures. Since the models are trained on large datasets, and the models themselves are large in size, they're often referred to as Large Language Models (LLMs).

Many LLMs are open-source and publicly available through communities like Hugging Face. Azure also offers the most commonly used LLMs as foundation models in the Azure Machine Learning model catalog. Foundation models are pretrained on large texts and can be fine-tuned for specific tasks with a relatively small dataset.

## Explore the model catalog
In the Azure Machine Learning studio, you can navigate to the model catalog to explore all available foundation models. Additionally, you can import any model from the Hugging Face open-source library into the model catalog.

>Hugging Face is an open-source community making models available to the public. You can find all models in their [catalog](https://huggingface.co/models). Additionally, you can explore the documentation to learn more about how individual models work, like [BERT](https://huggingface.co/docs/transformers/main/model_doc/bert).

![alt text](assets/model-catalog.png)

The Azure Machine Learning model catalog integrates with models from Hugging Face and other sources. The Azure Machine Learning model catalog makes it easier to explore, test, fine-tune, and deploy models.

## Explore foundation models
When you select a model from the Azure Machine Learning catalog, you can experiment with it to explore whether it meets your requirements. A foundation model is already pretrained and you can deploy a foundation model to an endpoint without any extra training. If you want the model to be specialized in a task, or perform better on domain-specific knowledge, you can also choose to fine-tune a foundation model.

Foundation models can be used for various tasks, including:

- Text classification
- Token classification
- Question answering
- Summarization
- Translation

To choose the foundation model that best fits your needs, you can easily test out different models in the model catalog. You can also review the data the models are trained on and possible biases and risks a model may have.

Some foundation models that are commonly used are:

| Model| Description |
|-|-|
| [BERT](https://huggingface.co/docs/transformers/main/model_doc/bert) (Bidirectional Encoder Representations from Transformers) | Focused on encoding information by using context from before and after a token (bidirectional). Commonly used when you want to fine-tune a model to perform a specific task like text classification and question answering. |
| [GPT](https://huggingface.co/docs/transformers/main/model_doc/openai-gpt) (Generative Pretrained Transformer)                        | Trained to create coherent and contextually relevant text, and is most commonly used for tasks like text generation and chat completions.                                    |
| [LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama) (Large Language Model Meta AI)                           | A family of models created by Meta. When training LLaMA models, the focus has been on providing more training data than increasing the complexity of the models. You can use LLaMA models for text generation and chat completions. |
| [T5](https://huggingface.co/docs/transformers/main/model_doc/t5) (Text-to-Text Transfer Transformer)                          | An encoder-decoder model that uses a text-to-text approach. By focusing on converting text-to-text, these types of models are ideal for translation.                       |
