# Attention, self-attention, transformers

Теория:
* [Encoder-Decoder](#encoder-decoder)
* [Attention](#attention)
* [Self-Attention](#self-attention)
* [Transformer](#transformer)
* [LLM: BERT, GPT, ...](#llm)
* [Размеры LLM моделей](#llm-sizes)
* [Byte Pair Encoding](#bpe)
* [Генерация текста](generation)

Примеры кода:
* [Модель PGT на pytorch](1_pure_torch.ipynb)
* [GPT с помощью модулей Huggingface](2_transformers.ipynb)
* [Предобученные модели](3_pretrained.ipynb)
* [Тонкая настройка предобученной модели](4_fine_tuning.ipynb)


<a name="encoder-decoder"></a>
## Encoder-Decoder

* [Lena Voita, Sequence to Sequence (seq2seq) and Attention, NLP Course](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)

![image.png](attachment:image.png)




<a name="attention"></a>
## Attention

[Attn: Illustrated Attention](https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3)

![image.png](attachment:image.png)


<a name="self-attention"></a>
## Self-Attention

[Illustrated: Self-Attention](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a)

![image.png](attachment:image.png)



<a name="transformer"></a>
## Transformer

[Vaswani et al. (2017) Attention Is All You Need, NIPS](https://arxiv.org/pdf/1706.03762.pdf) 

![image.png](attachment:image.png)


<a name="llm"></a>
## LLM: BERT, GPT, ...

[Jacob Devlin, et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)

[BERT Google repo](https://github.com/google-research/bert)

[Radford et al. (2018) Improving language understanding by generative pre-training](https://www.mikecaptain.com/resources/pdf/GPT-1.pdf)

[GPT-2 OpenAI repo](https://github.com/openai/gpt-2)

[PyTorch transformers](https://pytorch.org/hub/huggingface_pytorch-transformers/)

![image.png](attachment:image.png)


<a name="llm-sizes"></a>
## Размеры LLM моделей

![image.png](attachment:image.png)

[Amatriain et al. (2023) Transformer models: an introduction and catalog](https://amatriain.net/blog/transformer-models-an-introduction-and-catalog-2d1e9039f376/)

<a name="bpe"></a>
## Byte Pair Encoding

Suppose the data to be encoded is

 aaabdaaabac

The byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". Now there is the following data and replacement table:

 ZabdZabac\
 Z=aa

Then the process is repeated with byte pair "ab", replacing it with "Y":

 ZYdZYac\
 Y=ab\
 Z=aa

The only literal byte pair left occurs only once, and the encoding might stop here. Alternatively, the process could continue with [[Recursion|recursive]] byte pair encoding, replacing "ZY" with "X":

 XdXac\
 X=ZY\
 Y=ab\
 Z=aa

This data cannot be compressed further by byte pair encoding because there are no pairs of bytes that occur more than once.

To decompress the data, simply perform the replacements in the reverse order.

[Wikipedia: Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding)

<a name="generation"></a>
## Генерация текста

[How to generate text](https://huggingface.co/blog/how-to-generate)

### Greedy search
![image.png](attachment:image.png)

### Beam search
![image-2.png](attachment:image-2.png)

### Sampling
![image-2.png](attachment:image-2.png)

### Temperature
![image-3.png](attachment:image-3.png)

### Top-K Sampling, Top-p sampling и др.
