# Understanding Large Language Models
* Traditional methods excelled at categorization tasks such as email spam classification and straightforward pattern recognition that could be captured with handcrafted rules or simpler models. However, they typically underperformed in language tasks that demanded complex understanding and generation abilities.
* When we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension.
* The success behind LLMs can be attributed to the transformer architecture which underpins many LLMs, and the vast amounts of data LLMs are trained on, allowing them to capture a wide variety of linguistic nuances, contexts, and patterns that would be challenging to manually encode.

## What is an LLM?
* An LLM, a large language model, is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data.
* The "large" in large language model refers to both the model's size in terms of parameters and the immense dataset on which it's trained.
* LLM uses next-word prediction, which is sensible because it harnesses the inherent sequential nature of language to train models on understanding context, structure, and relationships within text.

## Applications of LLMs
* Today, LLMs are employed for machine translation, generation of novel texts, sentiment analysis, text summarization, and many other tasks. LLMs have recently been used for content creation, such as writing fiction, articles, and even computer code.
* LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI's ChatGPT or Google's Gemini. LLMs may also be used for effective knowledge retrieval from vast volumes of text in specialized areas such as medicine or law.

## Stages of building and using LLMs
* Coding an LLM from the ground up is an excellent exercise to understand its mechanics and limitations. Also, it equips us with the required knowledge for pretraining or finetuning existing open-source LLM architectures to our own domain-specific datasets or tasks.
* The general process of creating an LLM includes pretraining and finetuning.
  * The term "pre" in "pretraining" refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language.
  * Finetuning is a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains.
    * In instruction-finetuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text.
    * In classification finetuning, the labeled dataset consists of texts and associated class labels, for example, emails associated with spam and non-spam labels.   


## Using LLMs for different tasks
* Most modern LLMs rely on the transformer architecture, which is a deep neural network architecture introduced in the 2017 paper Attention Is All You Need. The transformer architecture depicted in Figure 1.4 consists of two submodules, an encoder and a decoder.
  * The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input.
  * The decoder module takes these encoded vectors and generates the output text from them.
* A key component of transformers and LLMs is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other.
* BERT is built upon the original transformer's encoder submodule. It specializes in masked word prediction, where the model predicts masked or hidden words in a given sentence. This unique training strategy equips BERT with strengths in text classification tasks.
* GPT focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts.
* GPT adepts at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input.

## Utilizing large datasets
* From each dataset, only a fraction of the data, (amounting to a total of 300 billion tokens) was used in the training process. This sampling approach means that the training didn't encompass every single piece of data available in each dataset. Instead, a selected subset of 300 billion tokens, drawn from all datasets combined, was utilized. Also, while some datasets were not entirely covered in this subset, others might have been included multiple times to reach the total count of 300 billion tokens.
* For context, consider the size of the CommonCrawl dataset, which alone consists of 410 billion tokens and requires about 570 GB of storage.
* Pretraining LLMs requires access to significant resources and is very expensive. For example, the GPT-3 pretraining cost is estimated to be $4.6 million in terms of cloud computing credits.


## A closer look at the GPT architecture
* GPT stands for Generative Pretrained Transformer and was originally introduced in the following paper:
  * Improving Language Understanding by Generative Pre-Training (2018) by Radford et al. from OpenAI, http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
* the original model offered in ChatGPT was created by finetuning GPT-3 on a large instruction dataset using a method from OpenAI's InstructGPT paper.
* It is actually very remarkable given that GPT models are pretrained on a relatively simple next-word prediction task, but can carry out other tasks such as spelling correction, classification, or language translation.
* The next-word prediction task is a form of self-supervised learning, which is a form of self-labeling. This means that we don't need to collect labels for the training data explicitly but can leverage the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict. 
* Since this next-word prediction task allows us to create labels "on the fly," it is possible to leverage massive unlabeled text datasets to train LLMs.
* The general GPT architecture is relatively simple, it's just the decoder part without the encoder. 
* Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model. Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text.
* The ability to perform tasks that the model wasn't explicitly trained to perform is called an "emergent behavior." This capability isn't explicitly taught during training but emerges as a natural consequence of the model's exposure to vast quantities of multilingual data in diverse contexts.


##  Building a large language model
* First, we will learn about the fundamental data preprocessing steps and code the attention mechanism that is at the heart of every LLM.
* Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable of generating new texts. And we will also go over the fundamentals of evaluating LLMs, which is essential for developing capable NLP systems.
*  The focus of stage 2 is on implementing training for educational purposes using a small dataset.
* Finally, in stage 3, we will take a pretrained LLM and finetune it to follow instructions such as answering queries or classifying texts -- the most common tasks in many real-world applications and research.
